Introduction to R and RStudio

The goal of this lab is to introduce you to R and RStudio, which you’ll be using throughout the course, both to learn the statistical concepts discussed in class and also to analyze real data and come to informed conclusions. To straighten out which is which: R is the name of the programming language itself and RStudio is a convenient interface. This is going to feel very unfamiliar. The more questions you ask me, the quicker you will get up to speed on the software. These labs will build on each other. You will want to refer back to previous labs often to remind yourself how to perform certain tasks.

As the labs progress, you are encouraged to explore beyond what the labs dictate; a willingness to experiment will make you a much better programmer. Before we get to that stage, however, you need to build some basic fluency in R. Today we begin with an introduction to some of the fundamental building blocks of R and RStudio: the interface, reading in data, basic commands, data types, and visualization.

Open RStudio and create a new Markdown document (File|New File|R Markdown, or use the icon dropdown menu on the upper left). Save this to wherever you are going to keep your labs. I suggest having a separate folder for each lab, since multiple files are generated. When you Knit you might have an error about packages, just install any that are needed and ask for help if you get further errors. You can, and should, delete all the pre-loaded text and code UNDER the first R chunk (the setup chunk should stay).

The lab contains “Exercises”, and you should make a separate header for each exercise (e.g. type ## Exercise 1). So your Markdown document will typically have the following flow:

A header titled “Exercise 1”, for example
Some text beneath, answering any written questions in the exercise or explaining your solution
A code chunk where you write the code necessary complete the exercise and support your answers

Not all exercises will require that you write text, and a few won’t require that you include R code. But if you need output from R to answer a question, that code must be included in your Markdown document.

The panel in the upper right contains your workspace (aka environment) as well as a history of the commands that you’ve previously entered. Any plots that you generate will show up in the panel in the lower right corner.

The panel on the lower left is where the action happens. It’s called the console. Every time you launch RStudio, it will have the same text at the top of the console telling you the version of R that you’re running. Below that information is the prompt. As its name suggests, this prompt is really a request, a request for a command. Initially, interacting with R is all about typing commands and interpreting the output. These commands and their syntax have evolved over decades (literally) and now provide what many users feel is a fairly natural way to access data and organize, describe, and invoke statistical computations.

You can use R as a calculator. To get you started, enter the following command at the R prompt (i.e. right after > on the console). You can either type it in manually or copy and paste it from this document.

2+2

And you can save this result to an object that you can access later

x <- 2+2

The arrow <- is called an ASSIGNMENT OPERATOR, and tells R to save an object called x that has the value of 4. This is similar to saving a value in a graphing calculator. There is a keyboard shortcut for this; it’s easy to use Google to find keyboard shortcuts (they will be different for Macs vs. PCs). There is also a keyboard shortcut menu in the Help menu in RStudio. I would recommend learning the shortcut for <- first; it’s annoying to type and you’ll use constantly.

Note that whatever name you want to save your object as must always be to the left of the assignment operator. You can also see this new object in your environment on the upper right pane.

Try typing x in the console to verify its value.

Throughout the semester you will learn about how to use R to do data analysis, and in the meantime you will be exposed to some programming. In addition, you will learn best practices for saving your code and making sure that your analysis is reproducible.

Creating a reproducible lab report

When you want to write a paper, you might open a Word document to type your ideas into, and save your work in. In RStudio we use a document type called an R Markdown document. R Markdown documents are useful for both running code and annotating the code with comments. The document can be saved, so you can refer back to your code later, and can be used to create other document types (html, word, pdf, or slides) for presenting the results of your analyses. R Markdown provides a way to generate clear and reproducible statistical analyses. In an R Markdown document, you can write text just like in Word, but you can also put ‘chunks’ of code in the document. Then you will ‘Knit’ the Markdown file to create a document with your text and it will also run your code and include the results in the document.

You’ll need to figure out whether code is needed to answer a particular question, and if so a new chunk of code can be inserted by clicking on the Insert button and choosing R from the dropdown menu. Again, there is a keyboard shortcut for this.

If you have pop-ups blocked on your laptop, you may see a box come up warning you when you Knit. Just click Try Again and you should see the results. You can also choose to see your output in the Viewer tab. You can change this setting by clicking the gear icon next to the Knit button. As you go along, you will discover that you have certain preferences (e.g. you might always want your knitted documents to display in the Viewer tab rather than a separate window). Tools|Global Options gives you a lot of ways to customize your RStudio experience that will hold for all R sessions (so you don’t always have to change options manually each time you make a new Markdown document). I would suggest spending a few minutes exploring this menu; ask me if you any preferences you’d like to change and don’t see how to do that.

R packages

R is an open-source programming language, meaning that users can contribute packages that make our lives easier, and we can use them for free. For this lab, and many others in the future, we will use the following R packages:

tidyverse: for data wrangling and visualization
openintro: for the datasets used in the R labs in this course

Even if these packages have already been installed on your laptop, you will still need to load them in your working environment in order to use them. To do so, type the following in the console (since you have opened a Markdown document, you might need to click on the Console tab at the bottom left of your screen and resize the window to see the Console better - you should experiment with resizing the four panes to your liking):

library(tidyverse)
library(openintro)

Note that these lines of code need to also appear at the top of your R Markdown document (I often put them in the code chunk labeled ‘setup’ that is automatically inserted when you make a new Markdown document). We need to load the packages both in the console and in your R Markdown document since these two environments work independently of each other.

A note on the difference between the console and your Markdown document: Think of the console as your sandbox. You can try out code there and see what happens. But your final product is your Markdown document, and they are separate. When R Knits, it creates a new enviroment, starts at the top of your Markdown document, and runs all the code in order. So any code you need must be in your Markdown document. That’s also why, when you want to play around in the console (your sandbox), you also have to make sure you have everything loaded there, as well (like your packages). Fortunately, you can tell R to run code from your Markdown document INSIDE the console, so you don’t have to type everything twice. We’ll get to that next.

The data: Dr. Arbuthnot’s baptism records

To get you started, make a new code chunk and type and run the following command from your Markdown file.

data("arbuthnot")

You can do this by

clicking on the green arrow at the top right of the code chunk in the R Markdown (.Rmd) file, or
putting your cursor on this line, and hitting the Run button on the upper right corner of the pane, or
typing the code in the console.

Think of “running code” in your console as telling R to “do this now”.

This command instructs R to load some data: the Arbuthnot baptism counts for boys and girls. You should see that the workspace area in the upper righthand corner of the RStudio window now lists a data set called arbuthnot that has 82 observations on 3 variables.

A note on Promise: When you load this dataset, you will see the word ‘Promise’ show up in the environment. This is because R doesn’t want to take up memory until you actually need to DO something with this dataset. If you double-click the name ‘arbutnot’ the data will open in a Viewer AND you’ll see the word ‘Promise’ change to the number of observations and variables.

This dataset is contained in the openintro library. As you interact with R, you will create a series of objects. Sometimes you load them as we have done here, and sometimes you create them yourself as the byproduct of a computation or some analysis you have performed.

The Arbuthnot data set refers to Dr. John Arbuthnot, an 18^th century physician, writer, and mathematician. He was interested in the ratio of newborn boys to newborn girls, so he gathered the baptism records for children born in London for every year from 1629 to 1710. We can take a look at the data by typing its name into the console.

arbuthnot

An advantage of RStudio is that it comes with a built-in data viewer. Click on the name arbuthnot in the Environment pane (upper right window) that lists the objects in your workspace. This will bring up an alternative display of the data set in the Data Viewer (upper left window). You can close the data viewer by clicking on the x in the tab above the data.

What you should see are four columns of numbers, each row representing a different year: the first entry in each row is simply the row number (an index we can use to access the data from individual years if we want), the second is the year, and the third and fourth are the numbers of boys and girls baptized that year, respectively. Use the scrollbar on the right side of the console window to examine the complete data set.

Note that the row numbers in the first column are not part of Arbuthnot’s data. R adds them as part of its printout to help you make visual comparisons. You can think of them as the index that you see on the left side of a spreadsheet. In fact, the comparison to a spreadsheet will generally be helpful. R has stored Arbuthnot’s data in a kind of spreadsheet or table called a data frame (or tibble).

You can see the dimensions of this data frame by typing (in the console):

dim(arbuthnot)

This command should output [1] 82 3, indicating that there are 82 rows and 3 columns (we’ll get to what the [1] means in a bit), just as it says next to the object in your workspace. You can see the names of these columns (or variables) by typing (in the console):

names(arbuthnot)

You should see that the data frame contains the columns year, boys, and girls. At this point, you might notice that many of the commands in R look a lot like math functions; that is, invoking R commands means supplying a function with some number of arguments. The dim and names commands, for example, each took a single argument, the name of a data frame.

Vectors

Let’s start to examine the data a little more closely. We can access the data in a single column of a data frame separately using a command like

arbuthnot$boys

This command will only show the number of boys baptized each year. The dollar sign basically says “go to the data frame that comes before me, and find the variable that comes after me”. It’s a way to reference a particular column.

What command would you use to extract just the counts of girls baptized? Try it! (Enter your answer in your R Markdown document and run the entire report by hitting Knit HTML. Now the R output you need is already in your report.)

Notice that the way R has printed these data is different. When we looked at the complete data frame, we saw 82 rows, one on each line of the display. These data are no longer structured in a table with other variables, so they are displayed one right after another. Objects that print out in this way are called vectors; they represent a set of numbers. R has added numbers in [brackets] along the left side of the printout to indicate locations within the vector. For example, 5218 follows [1], indicating that 5218 is the first entry in the vector (for the boys). And if [43] starts a line, then that would mean the first number on that line would represent the 43rd entry in the vector.

Technically, a vector is a data object that has multiple elements of the same type. So far, you have created one vector called x that has one element in it. That element’s type is numeric.

Copy, paste and run the following command into the console.

v <- c(2, 4, 6)

This vector contains three numbers, 2, 4, and 6. The c() function says to r to concatenate the values 2, 4, 6, into a single vector. Note in the Environment pane that your vector v contains numbers (listed as num). In other words, its type is numeric.

You can do math on a vector that contains numbers! For instance, copy, paste and run the following command into the console. This tells R to multiply each element of the vector v by 3.

v * 3

You can also make vectors of characters (words or strings).

Copy, paste and run the following command into a new code chunk. This vector has the name char.vec and contains 3 elements, all of which are designated as characters (or chr in the Environment pane). It doesn’t matter that “2” is a number, putting the elements in quotes tells R that they are all character data types, not numeric.

char.vec <- c("2", "Wheaton", "red")

Create a character vector called new.vec which contains your full name with each word as separate elements of the vector.

Data visualization

R has some powerful functions for making graphics. We will use the ggplot function for data visualization. Its first argument is the data you’re visualizing. Next we define the aesthetic mappings. In other words, the columns of the data that get mapped to certain aesthetic features of the plot, e.g. the x axis will represent the variable called year and the y axis will represent the variable called girls. Then, we add another layer to this plot where we define which geometric shapes we want to use to represent each observation in the data. In this case we want these to be points, hence we use geom_point.

ggplot(data = arbuthnot, mapping = aes(x = year, y = girls)) +
  geom_point()

If this seems like a lot, it is. And you will learn about the philosophy of building data visualizations in layers in detail soon. For now, follow along with the code that is provided.

Change the look of your report:

Click on the gear icon in on top of the R Markdown document, and select “Output Options…” in the dropdown menu. In the General tab of the pop up dialogue box try out different Syntax highlighting and theme options. Hit OK and Knit your document to see how it looks. Play around with these until you’re happy with the look.

Getting help:

R extensively documents all of its functions; to read what a function does and learn the arguments that are available to you, just type in a question mark followed by the name of the function that you’re interested in.

Try the following:

?dim

Notice that the help file replaces the plot in the lower right panel. You can toggle between plots and help files using the tabs at the top of that panel.

Tip: If you use the up and down arrow keys, you can scroll through your previous commands in the console, your so-called command history. You can also access it by clicking on the history tab in the upper right panel. This will save you a lot of typing in the future.

Use the up arrow to retrieve the last ggplot command and change geom_point to geom_line so that instead of having to type the entire line over again you can use the previously run code. Is there an apparent trend in the number of girls baptized over the years? How would you describe it? (To ensure that your lab report is comprehensive, be sure to include the code needed to make the plot as well as your written interpretation.)

Data mutation

Now, suppose we want to plot the total number of baptisms. We can type in mathematical expressions like

5218 + 4683

to see the total number of baptisms in 1629. We could repeat this once for each year, but there is a faster way. If we add the vector for baptisms for boys to that of girls, R will compute all sums simultaneously. Type the following in the console.

arbuthnot$boys + arbuthnot$girls

What you will see are 82 numbers (in that packed display, because we aren’t looking at a data frame here), each one representing the sum we’re after. Take a look at a few of them and verify that they are right. If you add two vectors of numbers together that are the exact same size (i.e. they have the same number of elements), R will add component-wise. In other words, the first elements of each vector will be added together, the second elements of each vector will be added together, etc. You will get a new vector of the same size (or length) as the original two, but with all the sums.

We’ll be using this new vector to generate some plots, so we’ll want to save it as a permanent column in our data frame. Type the following in the console.

arbuthnot <- arbuthnot %>% mutate(total = boys + girls)

What in the world is going on here? The %>% operator is called the piping operator. Basically, it takes whatever is to its left and pipes it into the first argument of the function on its right. The %>% operator also has a keyboard shortcut you might want to learn earlier rather than later. We will use it a lot.

A note on piping: Note that we can read this code as the following:

“Take the arbuthnot dataset and pipe it (as the first argument) into the mutate function. Using this mutate a new variable called total that is the sum of the variables called boys and girls. Then assign this new resulting dataset to the object called arbuthnot, i.e. overwrite the old arbuthnot dataset with the new one containing the new variable.”

This is essentially equivalent to going through each row and adding up the boys and girls counts for that year and recording that value in a new column called total.

Where is the new variable? When you make changes to variables in your dataset, click on the name of the dataset again (in the Environment Tab) to update it in the data viewer.

You’ll see that there is now a new column called total that has been tacked on to the data frame. The special symbol <- performs an assignment, taking the output of one line of code and saving it into an object in your workspace. In this case, you already have an object called arbuthnot, so this command updates that data set with the new mutated column.

We can make a plot of the total number of baptisms per year with the command (type all code below in the console)

ggplot(data = arbuthnot, mapping = aes(x = year, y = total)) +
  geom_line()

Similarly to how we computed the total number of births, we can compute the ratio of the number of boys to the number of girls baptized in 1629 with

5218 / 4683

or we can act on the complete columns with the expression

arbuthnot <- arbuthnot %>% mutate(boy_to_girl_ratio = boys / girls)

We can also compute the proportion of newborns that are boys in 1629

5218 / (5218 + 4683)

or this may also be computed for all years simultaneously and append it to the dataset:

arbuthnot <- arbuthnot %>% mutate(boy_ratio = boys / total)

Note that we are using the new total variable we created earlier in our calculations.

Now, generate a plot of the proportion of boys born over time (Run this code from your Markdown document). Note that you will need code to both create the boy_ratio column (see above) and code to make the plot. Everything you need must be in your Markdown document. The reason is that when you Knit, the document is created from scratch and ONLY runs the code in the Markdown document. So if you need the total column to make a new variable, you need to have put the code to make the total variable in your Markdown document. It doesn’t matter that you’ve already done this in the console. Knit to make sure everything is working. What do you see? Enter your answer in your Markdown document underneath your code chunk.

Finally, in addition to simple mathematical operators like subtraction and division, you can ask R to make comparisons like greater than, >, less than, <, and equality, ==. For example, we can ask if boys outnumber girls in each year with the expression

arbuthnot <- arbuthnot %>% mutate(more_boys = boys > girls)

This command adds a new variable to the arbuthnot dataframe containing the values of either TRUE if that year had more boys than girls, or FALSE if that year did not (the answer may surprise you). This variable contains different kind of data than we have considered so far. All other columns in the arbuthnot data frame have values are numerical (the year, the number of boys and girls). Here, we’ve asked R to create logical data, data where the values are either TRUE or FALSE. In general, data analysis will involve many different kinds of data types, and one reason for using R is that it is able to represent and compute with many of them.

More exercises

There has been some confusion about this section in the past, so just to be clear, yes, these “More exercises” are assigned and part of the grade.In the previous few pages, you recreated some of the displays and preliminary analysis of Arbuthnot’s baptism data. Your assignment involves repeating these steps, but for present day birth records in the United States. Load up the present day data with the command below. Any code you use to answer the questions below should be entered into your R Markdown file by first inserting code chunks and then typing the code you need inside the chunks. Any answers to questions should be entered OUTSIDE the code chunks. Please enter all code and text underneath the respective question number headings.

data(present)

The data are stored in a data frame called present.

What years are included in this data set? What are the dimensions of the data frame? What are the variable (column) names?
How do these counts compare to Arbuthnot’s? Are they on a similar scale? Why or why not?
Make a plot that displays the proportion of boys born over time. What do you see? Does Arbuthnot’s observation about boys being born in greater proportion than girls hold up in the U.S.? Include the plot in your response. Hint: You should be able to reuse your code from an exercise above, just replace the dataframe name.
In what year did we see the most total number of births in the U.S.? Hint: Sort your dataset in descending order based on the total column you created for the previous question. You can do this interactively in the data viewer by clicking on the arrows next to the variable names. To include the sorted results in your report you will need to use two new functions: arrange (for sorting) and desc (for descending order). Sample code provided below.

present %>% arrange(desc(total))

Complete the following with code in a code chunk (no text necessary). Remember that the code is just instructions for R.
- Create a variable called y with the value of 7 and a variable x with a value of 8.
- Multiply x by y, and store the answer in a variable named z.
- Run the following mathematical operation in a code chunk: 6 + 3
- Where does the answer appear? (please answer with text)
- Now add a code chunk, and save the results of 6 + 3 as a variable called a.
- Does the answer appear in your Knitted document? (please answer with text)
- Why do you think this is?
- Click the little broom icon in the upper right hand corner of the Environment pane. Click yes on the window that opens.
- What happened? (please answer with text, and don’t freak out)
- Go to the Run button at the top right of the R Markdown pane, and choose Run All (the last option)
- What happened? (please answer with text)

Your file will save automatically when you Knit, and your final lab must Knit without error before you submit, so Knit often! You want to figure out if you have errors in your Markdown document as you go along, not at the end.

Now let’s practice some basic formatting. Using this formatting tips page figure out how to put the following into your lab report. These all can get typed into the white section, where text goes. Hint: To put each of these on its own line hit a hard return (an extra one) between each line of text.
- Italicize like this
- Bold like this
- A superscript: R²

These data come from reports by the Centers for Disease Control listed in the references section.

Turning in the Lab

When you are finished with the lab, go to the very top and change the output from html_document to pdf_document. The pdf document doesn’t look as nice, but it is easier to grade and upload to schoology. Now turn in this PDF file to Schoology. Note the due date and time. If Schoology says it’s late, it’s late. Make sure your final Markdown document Knits properly and shows all your work. Look through it to make sure everything looks organized and professional. Also remember that if you needed output (graphs, numeric output, etc.) to answer a question, the code to generate that output needs to be in the lab report. Other code should not be included.